Search CORE

28 research outputs found

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Author: Majumdar Arjun
Mascharka David
Soklaski Ryan
Tran Philip
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/07/2018
Field of study

Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.Comment: CVPR 2018 pre-prin

arXiv.org e-Print Archive

Crossref

Behavioral Analysis of Vision-and-Language Navigation Agents

Author: Lee Stefan
Majumdar Arjun
Yang Zijiao
Publication venue
Publication date: 20/07/2023
Field of study

To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis -- examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance.Comment: accepted to CVPR202

arXiv.org e-Print Archive

Analysis of General Aviation fixed-wing aircraft accidents involving inflight loss of control using a state-based approach

Author: Majumdar Neelakshi
Marais Karen
Rao Arjun
Publication venue: 'Vilnius Gediminas Technical University'
Publication date: 21/12/2021
Field of study

Inflight loss of control (LOC-I) is a significant cause of General Aviation (GA) fixed-wing aircraft accidents. The United States National Transportation Safety Board’s database provides a rich source of accident data, but conventional analyses of the database yield limited insights to LOC-I. We investigate the causes of 5,726 LOC-I fixed‑wing GA aircraft accidents in the United States in 1999–2008 and 2009–2017 using a state-based modeling approach. The multi-year analysis helps discern changes in causation trends over the last two decades. Our analysis highlights LOC-I causes such as pilot actions and mechanical issues that were not discernible in previous research efforts. The logic rules in the state-based approach help infer missing information from the National Transportation Safety Board (NTSB) accident reports. We inferred that 4.84% (1999–2008) and 7.46% (2009–2017) of LOC-I accidents involved a preflight hazardous aircraft condition. We also inferred that 20.11% (1999–2008) and 19.59% (2009–2017) of LOC-I accidents happened because the aircraft hit an object or terrain. By removing redundant coding and identifying when codes are missing, the state-based approach potentially provides a more consistent way of coding accidents compared to the current coding system

VGTU Journals (Vilnius Gediminas Technical University - Vilnius Tech)

A Review on working of Cflow

Author: Madhuri Bhalekar, Arjun Jain, Debajyoti Majumdar, Nigrah Bamb, Ajinkya Shendre
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/06/2016
Field of study

In the field of program analysis, call graphs provide a succinct and human readable visual form of function flows in a program. Typically, call graphs are directed graphs, that determine the sequence of invocation of subroutines depicting the caller callee dependencies. This is used to tap the flow a program takes during execution, laying a foundation for further needful analysis. In this context, Call graph generators, taking a program as input, are typically used to generate call graphs. GNU Cflow, is one such tool. It accepts a C program or a number of C programs as input and generates a procedure flow, with clear caller-callee sequence distinguished by level indentation, with callee functions indented inside caller functions. This output can be altered by supplying different available flags and output-formatting options to suit the requirement. There is a lot of scope to revamp the Cflow source code and utilize the dispensed output. In this paper, we discuss the nature of cflow, its expected output, its limitations and scope for future research in it

International Journal on Recent and Innovation Trends in Computing and Communication

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

Author: Aggarwal Gunjan
Batra Dhruv
Devnani Bhavika
Hoffman Judy
Majumdar Arjun
Publication venue
Publication date: 12/10/2023
Field of study

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").Comment: code: https://github.com/gunagg/zso

arXiv.org e-Print Archive

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Author: Baevski Alexei
Batra Dhruv
Kira Zsolt
Majumdar Arjun
Maksymets Oleksandr
Ramrakhya Ram
Yadav Karmesh
Yokoyama Naoki
Publication venue
Publication date: 14/03/2023
Field of study

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in ") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods.Comment: 15 pages, 7 figures, 9 table

arXiv.org e-Print Archive

The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset

Purpose: To organize a knee MRI segmentation challenge for characterizing the semantic and clinical efficacy of automatic segmentation methods relevant for monitoring osteoarthritis progression. Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at two timepoints with ground-truth articular (femoral, tibial, patellar) cartilage and meniscus segmentations was standardized. Challenge submissions and a majority-vote ensemble were evaluated using Dice score, average symmetric surface distance, volumetric overlap error, and coefficient of variation on a hold-out test set. Similarities in network segmentations were evaluated using pairwise Dice correlations. Articular cartilage thickness was computed per-scan and longitudinally. Correlation between thickness error and segmentation metrics was measured using Pearson's coefficient. Two empirical upper bounds for ensemble performance were computed using combinations of model outputs that consolidated true positives and true negatives. Results: Six teams (T1-T6) submitted entries for the challenge. No significant differences were observed across all segmentation metrics for all tissues (p=1.0) among the four top-performing networks (T2, T3, T4, T6). Dice correlations between network pairs were high (>0.85). Per-scan thickness errors were negligible among T1-T4 (p=0.99) and longitudinal changes showed minimal bias (<0.03mm). Low correlations (<0.41) were observed between segmentation metrics and thickness error. The majority-vote ensemble was comparable to top performing networks (p=1.0). Empirical upper bound performances were similar for both combinations (p=1.0). Conclusion: Diverse networks learned to segment the knee similarly where high segmentation accuracy did not correlate to cartilage thickness accuracy. Voting ensembles did not outperform individual networks but may help regularize individual models.Comment: Submitted to Radiology: Artificial Intelligence; Fixed typo

arXiv.org e-Print Archive

Copenhagen University Research Information System

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Author: Abbeel Pieter
Arnaud Sergio
Batra Dhruv
Berges Vincent-Pierre
Chen Claire
Jain Aryan
Lin Yixin
Ma Yecheng Jason
Majumdar Arjun
Maksymets Oleksandr
Malik Jitendra
Meier Franziska
Rajeswaran Aravind
Silwal Sneha
Yadav Karmesh
Publication venue
Publication date: 31/03/2023
Field of study

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.Comment: Project website: https://eai-vc.github.i

arXiv.org e-Print Archive